122 research outputs found

    Colour Text Segmentation in Web Images Based on Human Perception

    No full text
    There is a significant need to extract and analyse the text in images on Web documents, for effective indexing, semantic analysis and even presentation by non-visual means (e.g., audio). This paper argues that the challenging segmentation stage for such images benefits from a human perspective of colour perception in preference to RGB colour space analysis. The proposed approach enables the segmentation of text in complex situations such as in the presence of varying colour and texture (characters and background). More precisely, characters are segmented as distinct regions with separate chromaticity and/or lightness by performing a layer decomposition of the image. The method described here is a result of the authorsā€™ systematic approach to approximate the human colour perception characteristics for the identification of character regions. In this instance, the image is decomposed by performing histogram analysis of Hue and Lightness in the HLS colour space and merging using information on human discrimination of wavelength and luminance

    Crowdsourcing historical tabular data : 1961 census of England and Wales

    Get PDF
    This paper describes how crowdsourcing can be incorporated as an integral part of a comprehensive technical workflow to identify, extract and validate data from large volumes of printed tabular statistics, and transform them into operable digital datasets using current structural and descriptive standards. The recently completed digitisation project for the 1961 Census of England and Wales (commissioned by the UK's Office for National Statistics) is used to provide details on data processing, crowdsourcing platform and tasks, crowd interaction, and validation of results. The multi-modal approach employed was very successful, delivering far more complete and validated data than automated processes alone could produce (due to the challenging nature of the source material)

    Efficient and effective OCR engine training

    Get PDF
    We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engineā€™s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail

    Highlights of the novel dewaterability estimation test (DET) device

    Get PDF
    Many industries, which are producing sludge in large quantities, depend on sludge dewatering technology to reduce the corresponding water content. A key design parameter for dewatering equipment is the capillary suction time (CST) test, which has, however, several scientific flaws, despite that the test is practical and easy-to-perform. The standard CST test has a few considerable drawbacks: its lack of reliability and difficulties in obtaining results for heavy sludge types. Furthermore, it is not designed for long experiments (e.g. >30ā€…min), and has only two measurement points (its two electrodes). In comparison, the novel dewaterability estimation test (DET) test is almost as simple as the CST, but considerably more reliable, faster, flexible and informative in terms of the wealth of visual measurement data collected with modern image analysis software. The standard deviations associated with repeated measurements for the same sludge is lower for the DET than for the CST test. In contrast to the CST device, capillary suction in the DET test is linear and not radial, allowing for a straightforward interpretation of findings. The new DET device may replace the CST test in the sludge-producing industries in the future

    ICFHR 2018 Competition on recognition of historical Arabic scientific manuscripts - RASM2018

    Get PDF
    This paper presents an objective comparative evaluation of page analysis and recognition methods for historical scientific manuscripts with text in Arabic language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICFHR2018, presenting the results of the evaluation of six methods ā€“ three submitted and three baseline systems. The challenges for the participants included page segmentation, text line detection, and optical character recognition (OCR). Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that, despite the challenging nature of the material, useful digitisation outputs can be produced
    • ā€¦
    corecore